Conversation
|
@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples. |
61a8396 to
bcbbf9a
Compare
|
Thanks for your continued work on this @OGuggenbuehl! Some general comments. Could you:
|
Pull Request Test Coverage Report for Build 19816136481Details
💛 - Coveralls |
c7264e6 to
ad155cc
Compare
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| import re | ||
| from typing import Literal, Optional |
There was a problem hiding this comment.
Apologies we have moved to using python 3.10 types since this PR is opened. So if you could drop Optional and instead use | None instead that would be great!
There was a problem hiding this comment.
If you could also update your branch with current main then the formatting scripts and tests should catch this change for you
| current_page = doc.meta.get("page_number", 1) if doc.meta else 1 | ||
| total_pages = doc.content.count(self.page_break_character) + 1 | ||
| logger.debug( | ||
| "Processing page number: {current_page} out of {total_pages}", | ||
| current_page=current_page, | ||
| total_pages=total_pages, | ||
| ) |
There was a problem hiding this comment.
Correct me if I'm wrong but this doesn't sound quite right. The incoming document is usually a converted PDF file from a converter that hasn't yet been split. So this would mean the page_number probably doesn't exist yet in the meta data.
Either way the message "Processing page number: {current_page} out of {total_pages}" I think can be off for a few reasons. We are not just processing current_page we are processing all the pages right?
Also currently you don't offset total pages by current page so we could end up with message like "Processing page number: 10 out of 2" if page_number from meta was equal to 10 and this doc only had one page_break_character in it right?
| if not self.keep_headers and content.startswith("\n"): | ||
| content = content[1:] # remove leading newline if headers not kept |
There was a problem hiding this comment.
I'd say let's drop this update and keep the leading newline character. We should utilize a DocumentCleaner after this splitter if we want to clean up this kind of leading and trailing whitespace type characters
| # skip splits w/o content | ||
| if not content.strip(): # this strip is needed to avoid counting whitespace as content | ||
| # add as parent for subsequent headers | ||
| active_parents = [h for h in header_stack[: level - 1] if h is not None] | ||
| active_parents.append(header_text) | ||
| if self.keep_headers: |
There was a problem hiding this comment.
I think this can produce an unwanted edge-case which is if keep_headers is True then they will be added to the content below on line 136 but we never reach there if the content is empty since we will hit the continue on 127. So currently it seems to me that this if self.keep_headers: doesn't do anything since once we reach here we will always skip this match anyways
|
|
||
| # Check that content is present and correct | ||
| # Test first split | ||
| header1_doc = split_docs[0] |
There was a problem hiding this comment.
never mind I see it's there already
| def test_page_break_handling_with_multiple_headers(): | ||
| text = "# Header\nFirst page\f Second page\f Third page" |
There was a problem hiding this comment.
I like the name of the test! But I don't see multiple headers. Could we update the example to use multiple headers and sub-headers?
I think ideally you could make a copy of the sample_text fixture that also includes page breaks.
There was a problem hiding this comment.
This one is hard to defend – I think this is an artifact of me reworking and merging tests. I'll make sure to make this one consistent with its name, my bad!
| assert subheader123_doc.content == "Content under header 1.2.3." | ||
|
|
||
|
|
||
| def test_split_parentheaders(sample_text): |
There was a problem hiding this comment.
I think we can remove this test now since we test the parent_headers in test_split_without_headers and test_basic_split
| headers = {doc.meta["header"] for doc in split_docs} | ||
| assert {"Another Header", "H1", "H2"}.issubset(headers) |
There was a problem hiding this comment.
Let's not use these vague checks. Let's explicitly check that the expected doc has the expected header. So like
split_docs[X] == "Another Header"
split_docs[Y] == "H1"
...| docs = [Document(content=text)] | ||
| result = splitter.run(documents=docs) | ||
| split_docs = result["documents"] | ||
| assert len(split_docs) == 24 |
There was a problem hiding this comment.
Let's also run the same checks for this output
for i in range(1, len(split_docs)):
prev_doc = split_docs[i - 1]
curr_doc = split_docs[i]
if prev_doc.meta["header"] == curr_doc.meta["header"]: # only check overlap within same header
prev_words = prev_doc.content.split()
curr_words = curr_doc.content.split()
assert prev_words[-2:] == curr_words[:2]| docs = [Document(content=text)] | ||
| result = splitter.run(documents=docs) | ||
| split_docs = result["documents"] | ||
| assert len(split_docs) == 21 |
There was a problem hiding this comment.
Could we also add checks that the split_ids are as expected?
| assert len(split_docs[0].content.split()) == 4 # "# Header" + 2 words | ||
| assert len(split_docs[1].content.split()) == 3 # 3 words (split_length) | ||
| assert len(split_docs[2].content.split()) == 3 # 3 words (split_length) | ||
| assert len(split_docs[3].content.split()) == 2 # 2 words (meets threshold) |
There was a problem hiding this comment.
Let's also add split_id checks
There was a problem hiding this comment.
Could we also update the assertions to be string comparisons instead of lengths? That would make it easier to see if something went wrong.
| assert len(split_docs) == 3 | ||
| assert len(split_docs[0].content.split()) == 3 # 3 words | ||
| assert len(split_docs[1].content.split()) == 3 # 3 words | ||
| assert len(split_docs[2].content.split()) == 4 # 4 words (due to threshold, not possible to split 3-1) |
There was a problem hiding this comment.
let's also add split_id checks here
Co-authored-by: Sebastian Husch Lee <[email protected]>
Head branch was pushed to by a user without write access
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
Proposed Changes:
Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers
How did you test it?
unit tests
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.